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ABSTRACT 



The abstract mathematical theory of partial differential equations (PDEs) is formulated in terms of man- 
ifolds, scalar fields, tensors, and the like, but these algebraic structures are hardly recognizable in actual 
PDE solvers. The general aim of the Sophus programming style is to bridge the gap between theory and 
practice in the domain of PDE solvers. Its main ingredients are a library of abstract datatypes correspond- 
ing to the algebraic structures used in the mathematical theory and an algebraic expression style similar 
to the expression style used in the mathematical theory. Because of its emphasis on abstract datatypes, 
Sophus is most naturally combined with object-oriented languages or other languages supporting abstract 
datatypes. The resulting source code patterns are beyond the scope of current compiler optimizations, but 
are sufficiently specific for a dedicated source-to-source optimizer. The limited, domain-specific, character 
of Sophus is the key to success here. This kind of optimization has been tested on computationally intensive 
Sophus style code with promising results. The general approach may be useful for other styles and in other 
application domains as well. 
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1 Introduction 

The purpose of the Sophus approach to writ- 
ing partial differential equation (PDE) solvers 
originally proposed in |[^] is to close the gap 
between the underlying coordinate-free math- 
ematical theory and the way actual solvers are 
written. The main ingredients of Sophus are: 

1. A library of abstract datatypes corre- 
sponding to manifolds, scalar fields, ten- 
sors, and the like, figuring in the abstract 
mathematical theory. 

2. Expressions involving these datatypes 
written in a side-effect free algebraic style 
similar to the expressions in the underly- 
ing mathematical theory. 

Because of the emphasis on abstract datatypes, 
Sophus is most naturally combined with 
object-oriented languages or other languages 
supporting abstract datatypes. Hence, we 
will be discussing high-performance computing 
(HPC) optimization issues within an object- 
oriented or abstract datatype context, using 
abstractions that are suitable for PDEs. 

Sophus is not simply object-oriented scien- 
tific programming, but a much more struc- 
tured approach dictated by the underlying 
mathematics. The object-oriented numerics 
paradigm proposed in 23 1 is related to So- 



phus in that it uses abstractions correspond- 
ing to familiar mathematical constructs such 
as tensors and vectors, but these do not in- 
clude continuous structures such as manifolds 
and scalar fields. The Sophus approach is 
more properly called coordinate-free numerics 
[13]. A fully worked example of conventional 
vs. coordinate- free programming of a com- 
putational fluid dynamics problem (wire coat- 
ing for Newtonian and non-Newtonian flows) 



is given in [11]. 

Programs in a domain-specific programming 
style like Sophus may need additional opti- 



mization in view of their increased use of ex- 
pensive constructs. On the other hand the re- 
strictions imposed by the style may lead to new 
high-level optimization opportunities that can 
be exploited by dedicated tools. Automatic 
selection of high-level HPC transformations 
(especially loop transformations) has been in- 
corporated in the IBM XL Fortran compiler, 
yielding a performance improvement for entire 
programs of typically less than 2x [14, p. 239]. 
We hope Sophus style programming will allow 
high-level transformations to become more ef- 
fective than this. 

In the context of Sophus and object-oriented 
programming this article focuses on the fol- 
lowing example. Object-oriented languages en- 
courage the use of self-mutating (self-updating, 
mutative) objects rather than a side-effect free 
algebraic expression style as advocated by So- 
phus. The benefits of the algebraic style are 
considerable. We obtained a reduction in 
source code size using algebraic notation vs. 
an object-oriented style of up to 30% in se- 
lected procedures of a seismic simulation code, 
with a correspondingly large increase in pro- 
grammer productivity and maintainability of 
the code as measured by the Cocomo tech- 
nique Q, for instance. On the negative side, 
the algebraic style requires lots of temporary 
data space for (often very large) intermediate 
results to be allocated and subsequently recov- 
ered. Using self-mutating objects, on the other 
hand, places some of the burden of variable 
management on the programmer and makes 
the source code much more difficult to write, 
read, and maintain. It may yield much better 
efficiency, however. Now, by including certain 
restrictions as part of the style, a precise re- 
lationship between self-mutating notation and 
algebraic notation may be achieved. Going 
one step further, we see that the natural way 
of building a program from high-level abstrac- 
tions may be in direct conflict with the way 
current compilers optimize program code. We 
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propose a source-to-source optimization tool, 
called CodeBoost, as a solution to many of 
these problems. Some further promising opti- 
mization opportunities we have experimented 
with but not yet included in CodeBoost are 
also mentioned. The general approach may be 
useful for other styles and other application do- 
mains as well. 

This paper is organized as follows. After a 
brief overview of tensor based abstractions for 
numerical programming and their realization 
as a software library (Section |2|), we discuss 
the relationship between algebraic and self- 
mutating expression notation, and how the for- 
mer may be transformed into the latter (Sec- 
tion |3|). We then discuss the implementation 
of the CodeBoost source-to-source optimiza- 
tion tool (Section |4|), and give some further 
examples of how software construction using 
class abstractions may conflict with efficiency 
issues as well as lead to new opportunities for 
optimization (Section ||). Finally, we present 
conclusions and future work (Section ^). 

2 A Tensor Based Library for 
Solving PDEs 

Historically, the mathematics of PDEs has 
been approached in two different ways. The 
solution-oriented approach uses concrete rep- 
resentations of vectors and matrices, discreti- 
sation techniques, and numerical algorithms, 
while the abstract approach develops the the- 
ory in terms of manifolds, scalar fields, tensors, 
and the like, focusing more on the structure of 
the underlying concepts than on how to calcu- 
late with them (see [|15|] for a good introduc- 
tion). 

The former approach is the basis for most of 
the PDE software in existence today. The lat- 
ter has very promising potential for the struc- 
turing of complicated PDE software when com- 
bined with template class based programming 



languages or other languages supporting ab- 
stract datatypes. As far as notation is con- 
cerned, the abstract mathematics makes heavy 
use of overloaded infix operators. Hence, user- 
definable operators and operator overloading 
are further desirable language features in this 



application domain. C++ [18| comes closest to 
meeting these desiderata, but, with modules 
and user-definable operators, Fortran 90/95 
| II Ej can also be used. In its current form Java 
1 10f| is less suitable. It has neither templates 
nor user-definable operators. Also, Java's au- 
tomatic memory management is not necessar- 
ily an advantage in an HPC setting Sec- 
tion 4]. Some of these problems may be allevi- 
ated in Generic Java [|| . The examples in this 
article use C++. 

2.1 The Sophus Library 

The Sophus library provides the abstract 
mathematical concepts from PDE theory as 
programming entities. Its components are 
based on the notions of manifold, scalar field 
and tensor field, while the implementations 
are based on the conventional numerical algo- 
rithms and discretisations. Sophus is currently 
structured around the following concepts: 

• Basic n-dimensional mesh structures. 
These are like rank n arrays (i.e., with 
n independent indices), but with opera- 
tions like +, — and * mapped over all ele- 
ments (much like Fortran 90/95 array op- 
erators) as well as the ability to add, sub- 
tract or multiply all elements of the mesh 
by a scalar in a single operation. There 
are also operations for shifting meshes in 
one or more dimensions. Parallel and se- 
quential implementations of mesh struc- 
tures can be used interchangeably, allow- 
ing easy porting between architectures of 
any program built on top of the mesh ab- 
straction. 
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• Manifolds. These represent the physi- 
cal space where the problem to be solved 
takes place. Currently Sophus only imple- 
ments subsets of R n . 

• Scalar fields. These may be treated for- 
mally as functions from manifolds to re- 
als, or as arrays indexed by the points 
of the manifold with reals as data ele- 
ments. Scalar fields describe the measur- 
able quantities of the physical problem to 
be solved. As the basic layer of "con- 
tinuous mathematics" in the library, they 
provide the partial derivation operations. 
Also, two scalar fields on the same man- 
ifold may be pointwise added, subtracted 
or multiplied. The different discretisation 
methods provide different designs for the 
implementation of scalar fields. A typical 
implementation would use an appropriate 
mesh as underlying discrete data struc- 
ture, use interpolation techniques to give 
a continuous interface, and map the +, 
— , and * operations directly to the corre- 
sponding mesh operations. In a finite dif- 
ference implementation partial derivatives 
are implemented using shifts and arith- 
metic operations on the mesh. 

• Tensors. These are generalizations of vec- 
tors and matrices and have scalar fields 
as components. Tensors define the gen- 
eral differentiation operations based on 
the partial derivatives of the scalar fields, 
and also provide operations such as com- 
ponentwise addition, subtraction and mul- 
tiplication, as well as tensor composi- 
tion and application (matrix multiplica- 
tion and matrix- vector multiplication). A 
special class are the metric tensors. These 
satisfy certain mathematical properties, 
but their greatest importance in this con- 
text is that they can be used to define 
properties of coordinate systems, whether 



Cartesian, axiosymmetric or curvilinear, 
allowing partial differential equations to 
be formulated in a coordinate-free way. 
The implementation of tensors relies heav- 
ily on the arithmetic operations of the 
scalar field classes. 

A partial differential equation in general pro- 
vides a relationship between spatial derivatives 
of tensor fields representing physical quantities 
in a system and their time derivatives. Given 
constraints in the form of the values of the ten- 
sor fields at a specific instance in time together 
with boundary conditions, the aim of a PDE 
solver is to show how the physical system will 
evolve over time, or what state it will converge 
to if left by itself. Using Sophus, the solvers are 
formulated on top of the coordinate- free layer, 
forming an abstract, high level program for the 
solution of the problem. 



The algebraic style for function declarations 
can be seen in Figure |l|, which shows spec- 
ifications of some operations for multidimen- 
sional meshes, the lowest level in the Sophus 
library. The mesh class is parameterized by 
a class T, so all operations on meshes like- 
wise are parameterized by T. Typical param- 
eters would be a float or scalar field class. The 
operations declared are defined to behave like 
pure functions, i.e., they do not update any in- 
ternal state or modify any of their arguments. 
Such operations are generally nice to work with 
and reason about, as their application will not 
cause any hidden interactions with the envi- 
ronment. 

Selected parts of the implementation of a 
continuous scalar field class are shown in Fig- 
ure This scalar field represents a multi- 
dimensional torus, and is implemented using 
a mesh class as the main data structure. The 
operations of the class have been implemented 



2.2 Sophus Style Examples 
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/** returns the mesh circularly shifted {i} positions in dimension {d} */ 
template<class T> Mesh<T> shift(const Mesh<T> & M, int d, int i) ; 

/** returns the elementwise sum of {lhs} and {rhs} */ 

template<class T> Mesh<T> operator+ (const Mesh<T>& lhs, const Mesh<T>& rhs) ; 

/** returns the elementwise difference of {lhs} and {rhs} */ 
template<class T> Mesh<T> operator- (const Mesh<T>& lhs, const Mesh<T>& rhs); 

/** returns the elementwise product of the {lhs} and {r} */ 
template<class T> Mesh<T> operator* (const Mesh<T>& lhs, const realfe r) ; 



Figure 1: Specification of algebraic style operators on a mesh template class. 



as self- mutating operations (Section ||), but are 
used in an algebraic way for clarity. It is easy 
to see that the partial derivation operation is 
implemented by shifting the mesh longer and 
longer distances, and gradually scaling down 
the impact these shifts have on the derivative, 
yielding what is known four-point, finite 
difference, partial derivation algorithm. The 
addition and multiplication operations are im- 
plemented using the element-wise mapped op- 
erations of the mesh. 

The meshes used in a scalar field tend to 
be very large. A TorusScalarField may typ- 
ically contain between 0.2 and 2MB of data, 
perhaps even more, and a program may con- 
tain many such variables. The standard trans- 
lation technique for a C++ compiler is to gen- 
erate temporary variables containing interme- 
diate results from subexpressions, adding a 
considerable run-time overhead to the alge- 
braic style of programming. An implementa- 
tion in terms of self-mutating operators might 
yield noticeable efficiency gains. For the addi- 
tion, subtraction and multiplication algorithms 
of Figure ^ a self-mutating style is easily ob- 
tained. The derivation algorithm will require 



extensive modification, such as shown in Fig- 
ure ||, with a marked deterioration in readabil- 
ity and maintainability as a result. 

3 Algebraic Notation and 
Self-Mutating Implementa- 
tion 

3.1 Self-Mutating Operations 

Let a, b and c be meshes with operators as 
defined in Figure |l|. The assignment 

c = a * 4.0 + b + a 

is basically evaluated as 

tempi = a * 4.0; 
temp2 = tempi + b; 
c = temp2 + a. 

This involves the creation of the meshes tempi, 
temp2, c, the first two of which are temporary. 
Obviously, since all three meshes have the same 
size and the operations in question are suffi- 
ciently simple, repeated use of a single mesh 
would have been possible in this case. In fact, 
for predefined types like integers and floats an 
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/** some operations on a scalar field implemented using the finite difference method 
*/ 

class TorusScalarField { 
private : 

Mesh<real> msf ; // data values for each grid point of the mesh 
real delta; // resolution, distance between grid points 



public: 



/** 4 point derivation algorithm, computes partial derivative in dimension d */ 
void uderiv (int d) 

{ Mesh<real> ans = (shift (msf ,d, 1) - shift (msf ,d, -1) ) * 0.85315148548241; 

ans = ans + (shift (msf ,d, 2) - shift (msf ,d, -2) ) * -0.25953977340489; 
ans = ans + (shift (msf ,d, 3) - shift (msf ,d, -3) ) * 0.06942058732686; 
ans = ans + (shift (msf ,d, 4) - shift (msf ,d, -4) ) * -0.01082798602277; 
msf = ans * (1/delta) ; 

> 

/** adding scalar field {rhs}- to this TorusScalarField */ 
void operator+= (const TorusScalarFieldfe rhs); 
{ msf = msf + rhs; 
> 

/** subtracting scalar field {rhs} from this TorusScalarField */ 
void operator-= (const TorusScalarFieldfe rhs); 
{ msf = msf - rhs; 
> 

/** multiplying scalar {r} to this TorusScalarField */ 
void operator*= (const realft r) ; 
{ msf = msf * r; 
> 



} 



Figure 2: A class TorusScalarField with self-mutating implementations of a partial derivation 
algorithm, a scalar field addition, and a scalar multiplication algorithm. The code itself is using 
algebraic notation for the mesh operations. 
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optimizing C or C++ compiler would trans- 
late the expression to a sequence of compound 
assignments^ 

c = a; c*=4.0; c+=b; c+=a, 

which repeatedly uses variable c to store inter- 
mediate results. 

We would like to be able to do a similar 
optimization of the mesh expression above as 
well as other expressions involving n-ary op- 
erators or functions of a suitable nature for 
user-defined types as proposed in ||. In an 
object-oriented language, it would be natural 
to define self-mutating methods (i.e., methods 
mutating this) for the mesh operations in the 
above expression. These would be closely sim- 
ilar to the compound assignments for prede- 
fined types in C and C++, which return a 
pointer to the modified data structure. Sophus 
demands a side-effect free expression style close 
to the underlying mathematics, however, and 
forbids direct use of self-mutating operations 
in expressions. Note that with a self- mutating 
+= operator returning the modified value of its 
first argument, the expression (a += b) += a 
would yield 2{a + b) rather than (2a) + b. 

By allowing the user to define self-mutating 
operations and providing a way to use them 
in a purely functional manner, their direct use 
can be avoided. There are basically two ways 
to do this, namely, by means of wrapper func- 
tions or by program transformation. These will 
be discussed in the following sections. 

3.2 Wrapper Functions 

Self-mutating implementations can be made 
available to the programmer in non-self- 
mutating form by generating appropriate 
wrapper functions. We developed a C++ pre- 
processor SCC doing this. It scans the source 

1 Not to be confused with the C notion of compound 
statement, which is a sequence of statements enclosed 
by a pair of braces. 



text for declarations of a standard form and au- 
tomatically creates wrapper functions for the 
self-mutating ones. This allows the use of an 
algebraic style in the program, and relieves the 
programmer of the burden of having to code 
the wrappers manually. 

A self-mutating operator op= is related to its 
algebraic analog op by the basic rule 

x = y op z; = x = copy(y) ; x op= z; (1) 

or, if the second argument is the one being up- 
dated,0 by the rule 

x = y op z; = x = copy(z) ; y op= x; (2) 

where = denotes equivalence of the left- and 
right-hand sides, x, y, z are C++ variables, 
and copy makes a copy of the entire data struc- 
ture. Now, the Sophus style does not allow 
aliasing or sharing of objects, and the (over- 
loaded) assignment operator x = y is always 
given the semantics of x = copy(y) as used in 
(1) and (2). Hence, in the context of Sophus 
(1) can be simplified to 

x=yopz; =x=y; x op= z; (3) 



and similarly for (2). We note the special case 
x = x op z; = x op= z; (4) 



and the obvious generalizations 

x = x op e; = x op= e; (5) 

x = el op e2; = x = el; x op= e2; (6) 



2 This does not apply to built-in compound assign- 
ments in C or C++, but user-defined compound assign- 
ments in C++ may behave in this way. 
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/** implements the basic mesh operations */ 
template<class T> class MeshCodel{ 



public : 

/** circularly shifts {this} mesh {i} positions in dimension {d} */ 
void ushift(int d, int i){ ... } 

/** adds {rhs} elementwise to -[this} mesh */ 
void operator+=(const MeshCodel<T> & rhs){ ... } 

/** subtracts {rhs} elementwise from {this} mesh */ 
void operator-=(const MeshCodel<T> & rhs){ ... } 

/** multiplies {this} mesh elementwise by {r} */ 
void operator*=(real r){ ... } 

} 



Figure 3: The use of self-mutating membership operations for a mesh class MeshCodel. 



template<class T> MeshCodel<T> shift (const MeshCodel<T> & MD, int d, int i) 

{ MeshCodel<T> C = MD; C .ushif t (d, i) ; return C; } 
template<class T> MeshCodel<T> operator* (const MeshCodel<T>& lhs , const MeshCodel<T>& rhs); 

{ MeshCodel<T> C = lhs; C += rhs; return C; } 
template<class T> MeshCodel<T> operator- (const MeshCodel<T>& lhs, const MeshCodel<T>& rhs); 

{ MeshCodel<T> C = lhs; C -= rhs; return C; } 
template<class T> MeshCodel<T> operator* (const MeshCodel<T>& lhs, const realfe r) ; 

{ MeshCodel<T> C = lhs; C *= r; return C; } 



Figure 4: Wrapper functions implementing the specification of a mesh using MeshCodel opera- 
tions generated by the SCC preprocessor. 
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where e, el, and e2 are expressions. SCC uses 
rules such as (6) to obtain purely functional be- 
havior from the self-mutating definitions in a 
straightforward way. Figure ||| shows the wrap- 
pers created by SCC for the self-mutating mesh 
operations of Figure ||. The case of n-ary oper- 
ators and functions is similar (n > 1). We note 
that, unlike C and C++ compound assign- 
ments, Sophus style self-mutating operators do 
not return a reference to the variable being up- 
dated and cannot be used in expressions. This 
simpler behavior facilitates their definition in 
Fortran 90/95 and other languages of interest 
to Sophus. 

The wrapper approach is superficial in that 
it does not minimize the number of tempo- 
raries introduced for expression evaluation as 



illustrated in Section 3.1. We therefore turn to 



a more elaborate transformation scheme. 

3.3 Program Transformation 

Transformation of algebraic expressions to self- 
mutating form with simultaneous minimiza- 
tion of temporaries requires a parse of the 
program, the collection of declarations of self- 
mutating operators and functions, and match- 
ing them with the types of the operators and 
functions actually used after any overloading 
has been resolved. Also, declarations of tempo- 
raries have to be added with the proper type. 
Such a preprocessor would be in a good po- 
sition to perform other source-to-source opti- 
mizations as well. In fact, this second ap- 
proach is the one implemented in CodeBoost 
with promising results. 

Figure [5] shows an optimized version 
of the partial derivation operator of class 
TorusScalarField (Figure ||) that might be 
obtained in this way. In addition to the trans- 
formation to self-mutating form, an obvious 
rule for ushift was used to incrementalize 
shifting of the mesh. 

Assuming the first argument is the one being 



template<class T> void F (T & x) 
{ x = x*x + x*2.0; } 

template<class T> void P (T & x) 
{ T tempi = x; 

tempi *= 2.0 ; 

x *= x; 

x += tempi; 

} 



Figure 6: Kernels F and P. 

updated, some further rules for binary opera- 
tors used in this stage are 

x opl= el op2 e2; = 

{T t = el; t op2= e2; x opl= t;} (7) 

{T tl = el; sl;}{T t2 = e2; s2;} = 

{T t = el; si; t = e2; s2;}. (8) 

Here x, t, tl, t2 are variables of type T; el, 
e2 are expressions; and self-mutating opera- 
tors op=, opl=, op2= correspond to operators 
op, opl, op2, respectively. Recall that Sophus 
does not allow aliasing. Rule (7) introduces 
a temporary variable t in a local environment 
and rule (8) reduces the number of temporary 
variables by merging two local environments 
declaring a temporary into a single one. 

3.4 Benchmarks 
3.4.1 Two Kernels 

Consider C++ procedures F and P shown in 
Figure ||. F computes x 2 + 2x using algebraic 
notation while P computes the same expression 
in self-mutating form using a single temporary 
variable tempi. Both were run with meshes 
of different sizes. The corresponding timing 
results are shown in Figures 0, ||, and ||. 
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/** some operations on a scalar field implemented using the finite difference 
method 

*/ 

public class ScalarField { 

MeshCodel msf(); // data values for each grid point of the mesh 
real delta; // resolution, distance between grid points 



/** 4 point derivation algorithm, computes partial derivative in dimension d */ 
public void uderiv (int d) 
{ MeshCodel msa = msf ; 

MeshCodel msb = msf; 

MeshCodel scratchO ; 

msa.ushif t (d, 1) ; 
msb .ushif t (d, -1) ; 

scratch = msa; scratch. uminus (msb) ; scratch. umult (0 . 85315148548241) ; 
msf = scratch; 

msa.ushif t (d, 1) ; 
msb .ushif t (d, -1) ; 

scratch = msa; scratch. uminus (msb) ; scratch. umult (-0. 25953977340489) ; 
msf .uplus (scratch) ; 

msa.ushif t (d, 1) ; 
msb .ushif t (d, -1) ; 

scratch = msa; scratch. uminus (msb) ; scratch. umult (0 . 06942058732686) ; 
msf .uplus (scratch) ; 

msa.ushif t (d, 1) ; 
msb .ushif t (d, -1) ; 

scratch = msa; scratch. uminus (msb) ; scratch. umult (-0. 01082798602277) ; 
msf .uplus (scratch) ; 

msf .umult (1/delta) ; 

} 



Figure 5: Optimized partial derivation operator of class TorusScalarField (Figure (2j). 
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Figure 7: Speed of conventional vs. Sophus style on SUN spare Ultra-2 workstation. More 
specifically, a SunOS 5.6 Generic-105181-06 sun^u spare SUNW, Ultra-2 hardware platform 
with 512MB internal memory and the SunSoft C++ compiler CC: Workshop Compilers 4-2 30 
Oct 1996 C++ 4.2 were used. 
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Figure 8: Speed of conventional vs. Sophus style on Silicon Graphics/Cray Origin 2000. More 
specifically, the Origin 2000 had hardware version IRIX64 ask 6.5SE 03250013 IP27 with a 
total of 24GB memory distributed among 128 processors. The C++ compiler used was MlPSpro 
Compilers: Version 7.2.1.1m. 
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Figure 9: Speed of conventional vs. Sophus style on SUN spare Ultra-2 workstation for small 
meshes. 



The mesh size is given in the leftmost col- 
umn. Mesh elements are single precision reals 
of 4B each. The second column indicates the 
benchmark procedure (F or P) or the ratio of 
the corresponding timings (F/P). The columns 
NC, NS, OC, and OS give the time in seconds 
of several iterations over each mesh so that a 
total of 16 777 216 elements were updated in 
each case. This corresponds to 32 768 itera- 
tions for mesh size 8 3 , 64 iterations for mesh 
size 64 3 , 1 iteration for mesh size 256 3 , and so 
forth. In columns _C (conventional style) the 
procedure calls are performed for each element 
of the mesh, while in columns _S (Sophus style) 
they are performed as operations on the entire 
mesh variables. 

Columns N_ give the time for unoptimized 
code (no compiler options), while columns 0_ 
give the time for code optimized for speed 
(compiler option -fast for the SUN CC com- 
piler and -Of ast for the Silicon Graphics/Cray 
CC compiler). The timings represent the me- 
dian of 5 test runs. These turned out to be rela- 
tively stable measurements, except in columns 
NS and OS, rows 256 3 F and P of Figure @, 



where the time for an experiment could vary 
by more than 100%. This is probably due to 
paging activity on disk dominating the actual 
CPU time. Also note that the transformations 
done by the optimizer are counterproductive in 
the P case, yielding an NS/OS ratio of 0.8. 

When run on the SUN the tests where the 
only really active processes, while the Cray was 
run in its normal multi-user mode but at a rel- 
atively quiet time of the day (Figure |l0|). As 
can be seen the load was moderate (around 
58) and although fully utilized, resources where 
not overloaded. 

In the current context, only columns NS 
and OS are relevant, the other ones are ex- 
plained in Section 15.11. As expected, the self- 



mutative form P is a better performer than 
the algebraic form F when the Sophus style is 
used. Disregarding the cases with disk pag- 
ing mentioned above, we see that the self- 
mutating mesh operations are 1.8-2.4 times 
faster than their algebraic counterparts, i.e., 
the CodeBoost transformation roughly doubles 
the speed of these benchmarks. 
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IRIX64 ask 6 . 5SE IP27 

load averages: 58.37 57.74 58.30 06:46:21 
385 processes: 323 sleeping, 3 stopped, 

1 ready, 58 running 
128 CPUs: 0.07, idle, 0.07, usr, 0.07, ker, 
0.07, wait, 0.07, xbrk, 0.07, intr 
Memory: 24G max, 23G avail, 709M free, 
25G swap, 17G free swap 



Figure 10: Random load information for test 
run on Silicon Graphics/Cray Origin 2000. 

3.4.2 Full Application: SeisMod 

We also obtained preliminary results on the 
Silicon Graphics/Cray Origin 2000 for a full 
application, the seismic simulation code Seis- 
Mod, which is written in C++ using the So- 
phus style. It is a collection of applications 
using the finite difference method for seismic 
simulation. Specific versions of SeisMod have 
been tailored to handle simulations with simple 
or very complex geophysical properties]^ We 
compared a version of SeisMod implemented 
using SCC generated wrapper functions and a 
self-mutating version produced by the Code- 
Boost source-to-source optimizer: 

• The algebraic expression style version 
turned out to give a 10-30% reduction 
in source code size and greatly enhanced 
readability for complicated parts of the 
code. This implies a significant program- 
mer productivity gain as well as a sig- 
nificant reduction in maintenance cost as 
measured by the Cocomo technique 0] , for 
instance 

• A 30% speed increase was obtained af- 
ter 10 selected procedures out of 150 
procedures with speedup potential had 



been brought in self-mutating form. This 
speedup turned out to be independent of 
C++ compiler optimization flag settings. 

This shows that although a more user- 
friendly style may give a performance penalty 
compared to a conventional style, it is possible 
to regain much of the efficiency loss by using 
appropriate optimization tools. Also, a more 
abstract style may yield more cost-effective 
software, even without these optimizations, if 
the resulting development and maintenance 
productivity improvement is taken into ac- 
count. 

4 Implementation of Code- 
Boost 

CodeBoost is a dedicated C++ source-to- 
source transformation tool for Sophus style 
programs. It has been implemented using the 



3 SeisMod is used and licensed by the geophysical 
modelling company UniGEO A.S. (Bergen, Norway). 



ASF+SDF language prototyping system 1 20 ] . 
ASF+SDF allows the required transformations 
to be entered directly as conditional rewrite 
rules whose right- and left-hand sides consist 
of language (in our case C++) patterns with 
variables and auxiliary transformation func- 
tions. The required language specific parsing, 
rewriting, and prettyprinting machinery is gen- 
erated automatically by the system from the 
high-level specification. Program transforma- 
tion tools for Prolog and the functional lan- 
guage Clean implemented in ASF+SDF are de- 
scribed in @,[1(|. 

An alternative implementation tool would 
have been the TAMPR program transforma- 
tion system f|, which has been used success- 
fully in various HPC applications. We pre- 
ferred ASF+SDF mainly because of its strong 
syntactic capabilities enabling us to generate 
a C++ environment fairly quickly given the 
complexity of the language. 

Another alternative would have been the use 
of template metaprogramming and/or expres- 
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sion templates |2T], ^] . This approach is highly 
C++ specific, however, and cannot be adapted 
to Fortran 90/95. 

Basically, the ASF+SDF implementation of 
CodeBoost involves the following two steps: 

1. Specify the C++ syntax in SDF, the 
syntax definition formalism of the system. 



2. Specify the required transformation rules 
as conditional rewrite rules using the C++ 
syntax, variables, and auxiliary transfor- 
mation functions. 

As far as the first step is concerned, specifi- 
cation of the large C++ syntax in SDF would 
involve a considerable effort, but fortunately a 
BNF-like version is available from the ANSI 
C++ standards committee. We obtained a 
machine-readable preliminary version @ and 
translated it largely automatically into SDF 
format. The ASF+SDF language prototyp- 
ing system then generated a C++ parser from 
it. The fact that the system accepts general 
context-free syntax rather then only LALR or 
other restricted forms of syntax also saved a 
lot of work in this phase even though the size 
of the C++ syntax taxed its capabilities. 

With the C++ parser in place, the required 
program transformation rules were entered as 
conditional rewrite rules. In general, a pro- 
gram transformer has to traverse the syntax 
tree of the program to collect the context- 
specific information used by the actual trans- 
formations. In our case, the transformer needs 
to collect the declaration information indi- 
cating which of the operations have a self- 
mutating implementation. Also, in Sophus the 
self-mutating implementation of an operator 
(if any) need not update this but can indicate 
which of the arguments is updated using the 
upd flag. The transformer therefore needs to 
collect not only which of the operations have 
a self-mutating implementation but also which 



argument is being mutated in each case. As 
a consequence, CodeBoost has to traverse the 
program twice: once to collect the declaration 
information and a second time to perform the 
actual transformations. Two other issues have 
to be taken into account: 

• C++ programs cannot be parsed be- 
fore their macros are expanded. Some 
Sophus-specific language elements are im- 
plemented as macros, but are more eas- 
ily recognized before expansion than af- 
ter. An example is the upd flag indicating 
which argument of an operator or function 
is the one to be updated. 

• Compared to the total number of con- 
structs in C++, there are relatively few 
constructs of interest. Since all constructs 
have to be traversed, this leads to a 
plethora of trivial tree traversal rules. As 
a result, the specification gets cluttered up 
by traversal rules, making it a lot of work 
to write as well as hard to understand. 
One would like to minimize or automat- 
ically generate the part of the specifica- 
tion concerned with straightforward pro- 
gram traversal. 

Our approach to the above problems is to 
give the specification a two-phase structure as 
shown in Figure 11. Under the reasonable as- 



sumption that the declarations are not spoiled 
by macros, the first phase processes the dec- 
larations of interest prior to macro expansion 
using a stripped version of the C++ gram- 
mar that captures the declaration syntax only. 
We actually used a Perl script for this, but it 
could have been done in ASF+SDF as well. It 
yields an ASF+SDF specification that is added 
to the specification of the second phase. The 
effect of this is that the second phase is spe- 
cialized for the program at hand in the sense 
that the transformation rules in the second 
phase can assume the availability of the dec- 
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Figure 11: Two-phase specification of CodeBoost. 



laration information and thus can be specified 
in a more algebraic, i.e., context independent 
manner. As a consequence, they are easy to 
read, consisting simply of the rules for the con- 
structs that may need transformation and us- 
ing the ASF+SDF system's built-in innermost 
tree traversal mechanism. In this way, we cir- 
cumvented the last-mentioned problem. 

As CodeBoost is developed further, it will 
have to duplicate more and more functions 
already performed by any C++ preproces- 
sor/compiler. Not only will it have to do pars- 
ing (which it is already doing now), but also 
template expansion, overloading resolution, 
and dependence analysis. It would be helpful 
if CodeBoost could tap into an existing com- 
piler at appropriate points rather than redo ev- 
erything itself. One of the candidates we are 
considering is the IBM Montana C++ com- 
piler/programming environment [17], which 
provides an open architecture with APIs giving 



access to various compiler intermediate rep- 
resentations with pointers back to the source 
text. 



5 Software Structure vs. Effi- 
ciency 

As noted in Section [I], programs in a domain- 
specific programming style like Sophus may 
need additional optimization in view of their 
increased use of expensive constructs. On the 
other hand, the restrictions imposed by the 
style may lead to new high-level optimiza- 
tion opportunities that can be exploited by 
a CodeBoost-like optimization tool. We give 
some further examples of both phenomena. 
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5.1 Inefficiencies Caused by the Use 
of an Abstract Style 

We consider an example. As explained in Sec- 
tion ^Tl], scalar field operations like + and * are 
implemented on top of mesh operations + and 
*. The latter will typically be implemented as 
iterations over all array elements, performing 
the appropriate operations pairwise on the el- 
ements. For scalar fields, expressions like 
Xi = A 1A * Vi + A lj2 * V 2 , 
X 2 = A 2) i *V\+ A 2)2 * V 2 
will force 8 traversals over the mesh data struc- 
ture. If the underlying meshes are large, this 
may cause many cache misses for each traver- 
sal. Now each of the scalar fields Aij, Vj, and 
Xj are actually implemented using a mesh, i.e., 
an array of n elements, and are represented in 
the machine by A [i , j , k] , V [ j , k] and X [ j , k] 
for k = 1, . . . , K, where K is the number of mesh 
points of the discretisation. In a conventional 
implementation this would be explicit in the 
code more or less as follows: 

for k := 1,K 

for j := 1,2 

X[j,k] := 
for i := 1,2 

X[j,k] += A[i,j,k]*V[j,k] 
endfor 
endfor 
endfor 

It would be easy for an optimizer to partition 
the loops in such a way that the number of 
cache misses is reduced by a factor of 8. 

In the abstract case aggressive in-lining is 
necessary to expose the actual loop nesting 
to the optimizer. Even though most existing 
C++ compilers do in-lining of abstractions, 
this would be non-trivial since many abstrac- 
tion layers are involved from the programmer's 
notation on top of the library of abstractions 
down to the actual traversals being performed. 



Consider once again the timing results 
shown in Figure [7|, Figure ||, and Figure ^. 
As was explained in Section |3.4| , the proce- 
dure calls in columns _C (conventional style) 
are performed for each element of the mesh, 
while they are performed as operations on the 
entire mesh variables in columns _S (Sophus 
style). Columns OS/OC for row P give the rel- 
evant figures for the performance loss of opti- 
mized Sophus style code relative to optimized 
conventional style code as a result of Sophus 
operating at the mesh level rather than at the 
element level. The benchmarks show a penalty 
of 1.1-5.3, except for data structures of less 
than 128kB on the SUN, where a speedup of up 
to 1.4 (penalty of 0.7) can be seen in Figure [| 
As is to be expected, for large data structures 
the procedure calls in column OC are more effi- 
cient than those in column OS, as the optimizer 
is geared towards improving the conventional 
kind of code consisting of large loops with pro- 
cedure calls on small components of data struc- 
tures. Also, cache and memory misses become 
very costly when large data structures have to 
be traversed many times. 

The figures for P in column OS of Figure ^ 
are somewhat unexpected. In these cases OS is 
the fastest alternative up to a mesh size some- 
where between 32 3 and 64 3 . This may be due 
to the smaller number of procedure calls in the 
OS case than in the OC case. In the latter 
case F and P are called once per element, i.e., 
16 777 216 times, while in the OS case they are 
called only once and the self-mutating opera- 
tions are called only 4 times. 

Another interesting phenomenon can be seen 
in column NC of Figure ^ and Figure |8[ Here 
the self-mutating version takes longer than the 
algebraic version, probably because the com- 
piler automatically puts small temporaries in 
registers for algebraic expressions, but cannot 
do so for self-mutating forms. The OC column 
shows that the optimizer eliminates the differ- 
ence. 
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5.2 New Opportunities for Opti- 
mization 

The same abstractions that were a source of 
worry in the previous section at the same time 
provide the specificity and typing making the 
use of high-level optimizations possible. Before 
they are removed by inlining, the information 
the abstractions provide can be used to select 
and apply appropriate datatype specific opti- 
mization rules. Sophus allows application of 
such rules at very high levels of abstraction. 
Apart from the expression transformation rules 
(l)-(8) (Section ||), which are applicable to a 
wide range of operators and functions, further 
examples at various levels of abstraction are: 

• The laws of tensor algebra. In Sophus the 
tensors contain the continuous scalar fields 
as elements (Section |2.1[) , thus making the 
abstract tensor operations explicit in ap- 
propriate modules. 

• Specialization of general tensor code for 
specific coordinate systems. A Cartesian 
coordinate system gives excellent simplifi- 
cation and axiosymmetric ones also give 
good simplification compared to general 
curvilinear code. 

• Optimization of operations on matrices 
with many symmetries. Such symme- 
tries offer opportunities for optimization 
in many cases, including the seismic 
modelling application mentioned in Sec- 
tion ggg . 
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Conclusions 
Work 



The Sophus class library in conjunction 
with the CodeBoost expression transfor- 
mation tool shows the feasibility of a 
style of programming PDE solvers that 
attempts to stay close to the abstract 



mathematical theory in terms of both the 
datatypes and the algebraic style of ex- 
pressions used. 

• Our preliminary findings for a full ap- 
plication, the Sophus style seismic simu- 
lation code SeisMod, indicate significant 
programmer productivity gains as a result 
of adopting the Sophus style. 

• There are numerous further opportunities 
for optimization by CodeBoost in addi- 
tion to replacement of appropriate oper- 
ators and functions by their self-mutating 
versions. Sophus allows datatype specific 
rules to be applied at very high levels of 
abstraction. 
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